Dynamic Monitoring of High-Performance Distributed Applications

نویسندگان

  • Dan Gunter
  • Brian Tierney
  • Keith R. Jackson
  • Jason Lee
  • Martin Stoufer
چکیده

Developers and users of high-performance distributed systems often observe performance problems such as unexpectedly low throughput or high latency. Determining the source of the performance problems requires detailed end-to-end instrumentation of all components, including the applications, operating systems, hosts, and networks. However, one must be very careful to design the instru-mentation to have extremely low overhead, and not affect the system being monitored. In this paper we present a very lightweight instrumentation system that can be dynamically activated to unobtrusively collect and aggregate detailed end-to-end monitoring information from distributed applications. We also show how emerging " Web Services " can be used to facilitate remote interaction with this system. Developers and users of high-performance distributed systems often see unexpected performance problems. It can be difficult to track down the cause of these performance problems because distributed system components interact in complex ways. Bottlenecks can occur in any of the components through which the data flows: the applications, the operating systems, the device drivers, the network interfaces, and/or in network hardware such as switches and routers. In previous work we have shown that detailed application monitoring is vital for both performance analysis and application debugging [34][4][33]. In general we have found that performance analysis of distributed systems requires monitoring events before and after every I/O operation. This can generate huge amounts of monitoring data, and great care must be taken to deal with this data in an efficient and unobtrusive manner. In large cross-domain systems such as computational or data Grids, fine-grained mechanisms for dynamic control of the monitoring are also essential. Consider the use-case of monitoring some of the High Energy Physics (HEP) Grid projects [25][3][12] in a Data Grid environment. These projects, which will handle hundreds of terabytes of data, require detailed instrumentation data to understand and optimize their data transfers. For example, the user of a Grid File Replication service [5][38] notices that generating new replicas is taking much longer than it did last week. The user has no idea why performance has changed. Is there a problem in the network, disk, end host, GridFTP server, GridFTP client, or some other Grid middleware such as the authentication or authorization system? Monitoring information is needed to pinpoint the bottleneck, and determine what changed to cause this bottleneck. Current performance must be analyzed, and compared against a baseline drawn from previously archived information. This performance analysis requires monitoring data for …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-performance Event Filtering for Distributed Dynamic Multi-point Applications: Survey and Evaluation

High-performance event filtering is an essential service in a distributed systems environment. We are developing an event filtering architecture to efficiently process the large volume of event traffic generated by distributed dynamic multi-point (DDMP) applications (such as automated monitoring and fault management in distributed systems). Our architecture supports the dynamic (re)configuratio...

متن کامل

Hierarchical Filtering-based Monitoring System for Large-scale Distributed Applications

On-line monitoring of large-scale distributed (LSD) applications is an eeective means to observe the appli-cations' behavior at run-time and provide status information required by debugging and management tools. In this paper, we describe and motivate the architecture and the components design of a scalable, high-performance, dynamic and non-intrusive monitoring system for LSD applications. The...

متن کامل

CloudKon Reloaded with Efficient Monitoring, Bundled Responses, and Dynamic Provisioning

In today's world the emphasis is on distributed systems which plays an important role on achieving good performance , high system utilization and scalability. Task scheduling and execution over large scale, distributed systems plays an important role on achieving good performance and high system utilization. Due to the explosion of parallelism found in today’s hardware, applications need to per...

متن کامل

Green Energy-aware task scheduling using the DVFS technique in Cloud Computing

Nowdays, energy consumption as a critical issue in distributed computing systems with high performance has become so green computing tries to energy consumption, carbon footprint and CO2 emissions in high performance computing systems (HPCs) such as clusters, Grid and Cloud that a large number of parallel. Reducing energy consumption for high end computing can bring various benefits such as red...

متن کامل

High-performance Monitoring Architecture for Large-scale Distributed Systems Using Event Filtering

Monitoring is an essential process to observe and improve the reliability and the performance of large-scale distributed (LSD) systems. In an LSD environment, a large number of events is generated by the system components during its execution or interaction with external objects (e.g. users or processes). Monitoring such events is necessary for observing the run-time behavior of LSD systems and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002